pacman::p_load(sf, spdep, GWmodel, SpatialML,
tmap, rsample, Metrics, tidyverse,
httr,jsonlite, rvest, units, matrixStats, ggpubr,car)Take-Home Exercise 3. Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods
01 Overview
1.1 Background
Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.
Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced to better calibrate predictive models for housing resale prices.
1.2 Objective
This study aims to calibrate a predictive model to predict HDB resale prices between July-September 2024 by using HDB resale transaction records in 2023.
02 Getting Started
2.1 Data
Asspatial dataset:
HDB Resale Flat Prices provided by Data.gov.sg will be used as the core data set. The study should focus on 3 room flat (2b1b)
Geospatial dataset:
- MP14_SUBZONE_WEB_PL: a polygon feature data providing information of URA 2014 Master Plan Planning Subzone boundary data. It is in ESRI shapefile format. This data set was also downloaded from Data.gov.sg
Locational factors with geographic coordinates:
Downloaded from Data.gov.sg.
Hawker Centre data is a list of hawker centres in Singapore. It is in geojson format.
Parks data is a list of parks in Singapore. It is in geojson format.
Supermarket data is a list of supermarkets in Singapore. It is in geojson format.
CHAS clinics data is a list of CHAS clinics in Singapore. It is in geojson format.
Downloaded from Datamall.lta.gov.sg.
MRT data is a list of MRT/LRT stations in Singapore with the station names and codes. It is in shapefile format.
Bus stops data is a list of bus stops in Singapore. It is in shapefile format.
Locational factors without geographic coordinates:
Downloaded from Data.gov.sg.
Retrieved/Scraped from other sources
CBD coordinates obtained from Google.
Shopping malls data is a list of Shopping malls in Singapore obtained from Wikipedia.
Good primary schools is a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum.
2.2 Packages
Here’s an overview of packages:
sf: Manages spatial vector data (points, lines, polygons) in R.spdep: Analyzes spatial dependence and autocorrelation in data.GWmodel: Builds geographically weighted models, like GWR and GWGLM.SpatialML: Supports spatial machine learning and predictive modeling.tmap: Creates static and interactive maps for spatial data visualization.rsample: Helps with data splitting for model training and validation.Metrics: Provides performance metrics like MSE and RMSE for model evaluation.tidyverse: A suite of tools for data manipulation, cleaning, and visualization.- httr: Handles HTTP requests for web data and APIs.
- jsonlite: Parses JSON data into R objects.
- rvest: Simplifies web scraping in R.
- units: Manages and converts measurement units.
- matrixStats: Efficient functions for matrix and vector stats.
- ggpubr: Adds publication-ready tools to ggplot2.
- car: Tools for regression diagnostics and hypothesis tests.
03 Data
3.1 Import Data
3.1.1 Geospatial Data
URA 2014 Master Plan Planning Subzone boundary data
mpsz = st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")Reading layer `MP14_SUBZONE_WEB_PL' from data source
`C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
Hawker Centre data is a list of hawker centres in Singapore. It is in geojson format.
hawker_center <- st_read("data/geospatial/HawkerCentresGEOJSON.geojson") %>%
st_transform(crs = 3414)Reading layer `HawkerCentresGEOJSON' from data source
`C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial\HawkerCentresGEOJSON.geojson'
using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
Parks data is a list of parks in Singapore. It is in geojson format.
parks <- st_read("data/geospatial/Parks.geojson") %>%
st_transform(crs = 3414)Reading layer `Parks' from data source
`C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial\Parks.geojson'
using driver `GeoJSON'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
Supermarket data is a list of supermarkets in Singapore. It is in geojson format.
supermarkets <- st_read("data/geospatial/SupermarketsGEOJSON.geojson") %>%
st_transform(crs = 3414)Reading layer `SupermarketsGEOJSON' from data source
`C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial\SupermarketsGEOJSON.geojson'
using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
MRT data is a list of MRT/LRT stations in Singapore with the station names and codes. It is in shapefile format.
mrt <- st_read(dsn = "data/geospatial", layer = "RapidTransitSystemStation") %>%
st_transform(crs = 3414) %>%
st_zm() Reading layer `RapidTransitSystemStation' from data source
`C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 231 features and 7 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
The warning says some place is not closed,we use make valid to fix it
# find invalid
invalid_geom <- mrt[!st_is_valid(mrt), ]
#Check how many are missing
print(nrow(invalid_geom))[1] 3
print(invalid_geom)Simple feature collection with 3 features and 7 fields (with 1 geometry empty)
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 26568.45 ymin: 27478.44 xmax: 28080.89 ymax: 37543.25
Projected CRS: SVY21 / Singapore TM
TYP_CD STN_NAM ATTACHEMEN SHAPE_AREA SHAPE_LEN TYP_CD_DES
109 0 <NA> CC29_HBF STN.zip 10146.62 670.0397 MRT
NA NA <NA> <NA> NA NA <NA>
162 0 <NA> <NA> 10070.02 1040.8044 MRT
STN_NAM_DE geometry
109 HARBOURFRONT MRT STATION POLYGON ((26736.44 27495.44...
NA <NA> POLYGON EMPTY
162 UPPER THOMSON MRT STATION POLYGON ((27808.12 37518.2,...
# 使用 st_make_valid() 单独修复无效的几何图形
#mrt[!st_is_valid(mrt), ] <- st_make_valid(mrt[!st_is_valid(mrt), ])I tried the make_invalid but it didn’t work. So we check the invalid ones and we add some buffer to make it close. The buffer was 10 because considering 10m to the scale of a whole country, it would be acceptable.
invalid_geom <- invalid_geom %>%
filter(!is.na(STN_NAM_DE))valid_geom <- invalid_geom %>%
st_buffer(dist = 10) # add 10 meter buffer to close it# delete the 2 invalid rows from `mrt`
mrt <- mrt %>%
filter(!(STN_NAM_DE %in% c("HARBOURFRONT MRT STATION", "UPPER THOMSON MRT STATION")))
# append `valid_geom` to `mrt`
mrt <- bind_rows(mrt, valid_geom)I observed the data and see Ulu Pandan Depot and Mandai Depot which I never heard of even if I checked before I went to the Zoo. Then GPT it to find that they are actually warehouse and place to fix the train like when green line shut down. So filter it to keep station only
mrt <- mrt %>%
filter(str_detect(STN_NAM_DE, "STATION$"))
mrt_sf <- mrt%>%
select(geometry) Bus stops
bus_sf <- st_read(dsn = "data/geospatial", layer = "BusStop") %>%
st_transform(crs = 3414) %>% # 转换到 EPSG 3414
st_zm() %>% # 去掉 Z 维度
select(geometry) Reading layer `BusStop' from data source
`C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21
3.1.2 Aspatial Data
Primary Schools
The primary schools infomation is extracted from General information of schools from gov.data, and I have processed it in excel to drop some column and picked out primary and mixed level school(in case missing some primary schools)
allschools <- read.csv("data/aspatial/Generalinformationofschools.csv")
head(allschools) school_name address postal_code
1 ADMIRALTY PRIMARY SCHOOL 11 WOODLANDS CIRCLE 738907
2 AHMAD IBRAHIM PRIMARY SCHOOL 10 YISHUN STREET 11 768643
3 AI TONG SCHOOL 100 Bright Hill Drive 579646
4 ALEXANDRA PRIMARY SCHOOL 2A Prince Charles Crescent 159016
5 ANCHOR GREEN PRIMARY SCHOOL 31 Anchorvale Drive 544969
6 ANDERSON PRIMARY SCHOOL 19 ANG MO KIO AVE 9 569785
mainlevel_code
1 PRIMARY
2 PRIMARY
3 PRIMARY
4 PRIMARY
5 PRIMARY
6 PRIMARY
Top Primary Schools
This data comes from a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum. According to Wikipedia, there are about 180 primary schools in Singapore, so the top 15% percentile will be around rank 27, therefore I include top 25 into the study.
topschools_csv <- read.csv("data/aspatial/topschools.csv")
head(topschools_csv) school_name
1 Holy Innocents’ Primary School
2 Ai Tong School
3 Nan Chiau Primary School
4 Chongfu School
5 Nanyang Primary School
6 CHIJ Primary (Toa Payoh)
But the geometry is missing, so we need to use API to get the geometry data. The code was written by
HDB Resale Data
resale <- read_csv("data/aspatial/ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv") %>%
filter(month >= "2023-01" & month <= "2024-09")Rows: 193496 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3.2 Data Wrangling
3.2.1 One Map API
Since some feature has no coordinates and geometry, the API code developed by Prof. Kam Tin Seong from SMU will be used here.
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
}
else {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}3.2.2 Get Latest position of Resold HDB
resale_tidy <- resale %>%
mutate(address = paste(block,street_name)) %>%
mutate(remaining_lease_yr = as.integer(
str_sub(remaining_lease, 0, 2)))%>%
mutate(remaining_lease_mth = as.integer(
str_sub(remaining_lease, 9, 11)))Select data from 2023 and 2024 Jul to Sep as required by the task, and filter out 3 room resale data as our focus
# Filter the data from resale_tidy for the year 2023 and for Jul-Sep 2024,
# only keeping records where flat_type is "3 ROOM"
resale_selected <- resale_tidy %>%
filter(
# Select all records from the year 2023 or from Jul-Sep 2024
(str_starts(month, "2023") | month %in% c("2024-07", "2024-08", "2024-09")) &
flat_type == "3 ROOM" # Only keep records where flat_type is "3 ROOM"
)
# Display the filtered results
head(resale_selected)# A tibble: 6 × 14
month town flat_type block street_name storey_range floor_area_sqm flat_model
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 2023… ANG … 3 ROOM 225 ANG MO KIO… 04 TO 06 67 New Gener…
2 2023… ANG … 3 ROOM 310C ANG MO KIO… 25 TO 27 70 Model A
3 2023… ANG … 3 ROOM 225 ANG MO KIO… 07 TO 09 67 New Gener…
4 2023… ANG … 3 ROOM 319 ANG MO KIO… 04 TO 06 73 New Gener…
5 2023… ANG … 3 ROOM 319 ANG MO KIO… 07 TO 09 73 New Gener…
6 2023… ANG … 3 ROOM 220 ANG MO KIO… 04 TO 06 67 New Gener…
# ℹ 6 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
# resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
# remaining_lease_mth <int>
Use the API above to get the coordinate of resale HDB property
add_list <- sort(unique(resale_selected$address))
coords <- get_coords(add_list)Append the coords back to resale
resale <- resale_selected %>%
left_join(coords, by = c("address" = "address")) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = 3414)write_rds(resale, "data/rds/resale.rds")Since we need to calculate the proximity later, we need to transform the data from WGS84 to singapore’s crs system
hdb <- read_rds("data/rds/resale.rds")
hdb_sf <- st_as_sf(hdb, coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = 3414)
head(hdb_sf)Simple feature collection with 6 features and 15 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 28537.68 ymin: 38517.37 xmax: 29564.76 ymax: 38825.23
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 16
month town flat_type block street_name storey_range floor_area_sqm flat_model
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 2023… ANG … 3 ROOM 225 ANG MO KIO… 04 TO 06 67 New Gener…
2 2023… ANG … 3 ROOM 310C ANG MO KIO… 25 TO 27 70 Model A
3 2023… ANG … 3 ROOM 225 ANG MO KIO… 07 TO 09 67 New Gener…
4 2023… ANG … 3 ROOM 319 ANG MO KIO… 04 TO 06 73 New Gener…
5 2023… ANG … 3 ROOM 319 ANG MO KIO… 07 TO 09 73 New Gener…
6 2023… ANG … 3 ROOM 220 ANG MO KIO… 04 TO 06 67 New Gener…
# ℹ 8 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
# resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
# remaining_lease_mth <int>, postal <chr>, geometry <POINT [m]>
3.2.3 Top Primary Schools
These data comes from a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum. According to Wikipedia, there are about 180 primary schools in Singapore, so the top 15% percentile will be around rank 27, therefore I include top 25 into the study.
First we select top schools from the complete list. To do that we need to clean the align the case and format of school names
clean_school_name <- function(name) {
name %>%
toupper() %>% # 转换为大写
str_replace_all("’", "'") %>% # 替换右弯引号为直引号
str_replace_all("‘", "'") %>% # 替换左弯引号为直引号
str_replace_all("ST\\.", "SAINT") %>% # 替换 'St.' 为 'Saint'
str_replace_all("[^A-Z0-9 ]", "") %>% # 去除标点符号
str_replace_all("PRIMARY SCHOOL|PRIMARY SECTION|PRIMARY", "") %>% # 移除描述词
str_squish() # 去除多余空格
}
# clean the column school_name in allschools and topschools
allschools$school_name <- sapply(allschools$school_name, clean_school_name)
topschools_csv$school_name <- sapply(topschools_csv$school_name, clean_school_name)
# filter by capital case
topschools <- allschools[allschools$school_name %in% topschools_csv$school_name, ]
print(topschools) school_name address postal_code
3 AI TONG SCHOOL 100 Bright Hill Drive 579646
9 ANGLOCHINESE SCHOOL JUNIOR 16 WINSTEDT ROAD 227988
10 ANGLOCHINESE SCHOOL 50 BARKER ROAD 309918
24 CATHOLIC HIGH SCHOOL 9 BISHAN STREET 22 579767
33 CHIJ TOA PAYOH 628 Lorong 1 Toa Payoh 319765
34 CHIJ SAINT NICHOLAS GIRLS SCHOOL 501 ANG MO KIO STREET 13 569405
35 CHONGFU SCHOOL 170 YISHUN AVENUE 6 768959
50 FAIRFIELD METHODIST SCHOOL 100 DOVER ROAD 139648
66 HENRY PARK 1 HOLLAND GROVE ROAD 278790
67 HOLY INNOCENTS 5 Lorong Low Koon 536451
81 KONG HWA SCHOOL 350 GUILLEMARD ROAD 399772
83 KUO CHUAN PRESBYTERIAN 8 BISHAN STREET 13 579793
87 MARIS STELLA HIGH SCHOOL 25 MOUNT VERNON ROAD 368051
93 METHODIST GIRLS SCHOOL 11 Blackmore Drive 599986
95 NAN CHIAU 50 ANCHORVALE LINK 545080
96 NAN HUA 30 Jalan Lempeng 128806
97 NANYANG 52 KING'S ROAD 268097
115 PEI CHUN PUBLIC SCHOOL 16 LORONG 7 TOA PAYOH 319320
132 RED SWASTIKA SCHOOL 350 BEDOK NORTH AVENUE 3 469719
137 ROSYTH SCHOOL 21 Serangoon North Avenue 4 555855
138 RULANG 6 JURONG WEST STREET 52 649295
145 SINGAPORE CHINESE GIRLS 190 DUNEARN ROAD 309437
152 SAINT HILDAS 2 Tampines Ave 3 529706
154 SAINT JOSEPHS INSTITUTION JUNIOR 3 ESSEX ROAD 309331
160 TAO NAN SCHOOL 49 MARINE CRESCENT 449761
mainlevel_code
3 PRIMARY
9 PRIMARY
10 PRIMARY
24 MIXED LEVELS
33 PRIMARY
34 MIXED LEVELS
35 PRIMARY
50 PRIMARY
66 PRIMARY
67 PRIMARY
81 PRIMARY
83 PRIMARY
87 MIXED LEVELS
93 PRIMARY
95 PRIMARY
96 PRIMARY
97 PRIMARY
115 PRIMARY
132 PRIMARY
137 PRIMARY
138 PRIMARY
145 PRIMARY
152 PRIMARY
154 PRIMARY
160 PRIMARY
add_list <- sort(unique(topschools$address))#|eval: False
add_list <- sort(unique(topschools$address))
coords_topschools <- get_coords(add_list)
write_rds(coords_topschools, "data/rds/topschools.rds")topschools <- read_rds("data/rds/topschools.rds")
topschools_sf <- st_as_sf(topschools, coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = 3414)
#drop 1 and 2 columns to keep only the geometry
topschools_sf <- topschools_sf[, -c(1, 2)]
head(topschools_sf)Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 22442.59 ymin: 31484.03 xmax: 30450.9 ymax: 38071.92
Projected CRS: SVY21 / Singapore TM
geometry
1 POINT (22544.29 33216.96)
2 POINT (27966.81 38071.92)
3 POINT (22673.28 31484.03)
4 POINT (22442.59 34984.58)
5 POINT (30450.9 35439.72)
6 POINT (28849.34 32406.84)
3.2.4 CBD
The CBD of Singapore encompasses areas like Marina Bay, Tanjong Pagar, and Shenton Way, where a lot’s of business locates. The Geospatial Data Science Lab at the National University of Singapore lists the CBD center coordinates as 1.2800°N, 103.8500°E. It is located in the Raffles Place area, which is considered the heart of the Singapore CBD.
Build a cbd_sf just like what we did to school and hdb
cbd <- data.frame(
longitude = 103.8500,
latitude = 1.2800
)
cbd_sf <- st_as_sf(cbd, coords = c("longitude", "latitude"), crs = 4326)
cbd_sf <- st_transform(cbd_sf, crs = st_crs(topschools_sf))
print(cbd_sf)Simple feature collection with 1 feature and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 29856.51 ymin: 29161.42 xmax: 29856.51 ymax: 29161.42
Projected CRS: SVY21 / Singapore TM
geometry
1 POINT (29856.51 29161.42)
3.2.5 Hawker center, Parks, and Supermarkets
For hawker_center,parks,supermarkets, the geometry is point Z instead of point, which means it’s 3D data point with the altitude. We need to drop it to align with other 2D data points.
Then we drop columns we are not going to use later
hawker_center
hawker_center_sf <- hawker_center %>%
st_zm() %>% # 去掉 Z 维度
select(geometry) # 只保留 geometry 列
head(hawker_center_sf)Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 22042.51 ymin: 29650.7 xmax: 35955.52 ymax: 47850.43
Projected CRS: SVY21 / Singapore TM
geometry
1 POINT (29874.82 29650.7)
2 POINT (22042.51 46139.03)
3 POINT (24816.7 31094.91)
4 POINT (32867.9 41500.77)
5 POINT (35955.52 43336.13)
6 POINT (26794.39 47850.43)
Now we check the summary and see only X and Y are kept, and the crs is correct.
We repeat this process to Parks and Superarkets
parks_sf <- parks %>%
st_zm() %>%
select(geometry)
head(parks_sf)Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 19531.82 ymin: 32755.86 xmax: 37908.12 ymax: 39763.61
Projected CRS: SVY21 / Singapore TM
geometry
1 POINT (22197.26 33106.05)
2 POINT (22371.87 32755.86)
3 POINT (22355.91 34333.61)
4 POINT (19531.82 39763.61)
5 POINT (21136.8 35335.69)
6 POINT (37908.12 35176.79)
supermarkets_sf <- supermarkets %>%
st_zm() %>% # 去掉 Z 维度
select(geometry) # 只保留 geometry 列
head(supermarkets_sf)Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 32184.01 ymin: 32947.46 xmax: 41384.47 ymax: 42685.17
Projected CRS: SVY21 / Singapore TM
geometry
1 POINT (35561.22 42685.17)
2 POINT (32184.01 32947.46)
3 POINT (33903.48 39480.46)
4 POINT (37083.82 35017.47)
5 POINT (41320.3 37283.82)
6 POINT (41384.47 37152.14)
3.2.6 Add Proximity to Facilities
The proximity function will be applied to calculate the distance from each property to facilities
proximity <- function(df1, df2, varname) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units()
df1[,varname] <- rowMins(dist_matrix)
return(df1)
}The we calculate the proximity to CBD, mrt, bus, hawker centre, park, good primary schools, and supermarkets.
Assure that all the data are in EPSG:3414
# Assure that all the data are in EPSG:3414
hdb_sf <- st_transform(hdb_sf, 3414)
cbd_sf <- st_transform(cbd_sf, 3414)
mrt_sf <- st_transform(mrt_sf, 3414)
bus_sf <- st_transform(bus_sf, 3414)
hawker_center_sf <- st_transform(hawker_center, 3414)
topschools_sf <- st_transform(topschools_sf, 3414)
parks_sf <- st_transform(parks, 3414)
supermarkets_sf <- st_transform(supermarkets, 3414)hdb_sf <- proximity(hdb_sf, cbd_sf, "PROX_CBD") %>%
proximity(., mrt_sf, "PROX_MRT") %>%
proximity(., bus_sf, "PROX_BUS") %>%
proximity(., hawker_center_sf, "PROX_HAWKER") %>%
proximity(., topschools_sf, "PROX_TOPSCHOOL") %>%
proximity(., parks_sf, "PROX_PARK") %>%
proximity(., supermarkets_sf, "PROX_MKT")3.2.7 Add Number of Nearby Facilities
Number of facilities nearby may also contribute to the resale price.
count_facilities_within_radius <- function(df1, facilities, varname, radius) {
# Calculate the distance matrix between df1 and facilities
dist_matrix <- st_distance(df1, facilities) %>%
drop_units() %>%
as.data.frame()
# Count facilities within the specified radius for each location in df1
df1[[paste0(varname, "_count")]] <- rowSums(dist_matrix <= radius)
return(df1)
}We set the radius to 400 because we assume in most of time you cannot walk straightly from your home to destination. We assume the block is a square, if the long side is 400 then sum of two shorter side will be around 560 which is still within 5-10 min walking distance for a normal people.
However top school is special and walking distance wont’ be the most important consideration, we give higher tolerance to top school (25 out of 180+)
hdb_sf <- count_facilities_within_radius(hdb_sf, mrt_sf, "mrt", 600) %>%
count_facilities_within_radius(., bus_sf, "bus", 600) %>%
count_facilities_within_radius(., hawker_center_sf, "hawker", 600) %>%
count_facilities_within_radius(., topschools_sf, "topschool", 1200) %>%
count_facilities_within_radius(., parks_sf, "park", 600) %>%
count_facilities_within_radius(., supermarkets_sf, "market",600 )3.3 EDA
3.3.1 HDB Resale Price Using Statistical Graphics
First we plot the distribution of resale_price of 3 room HDB by using appropriate Exploratory Data Analysis (EDA) as shown in the code chunk below
ggplot(data=hdb_sf, aes(x=`resale_price`)) +
geom_histogram(bins=20, color="black", fill="light blue")
This histogram shows the distribution of HDB resale prices (resale_price) in Singapore. Here is a description of the chart:
Price Distribution: Most resale prices are concentrated between 400,000 and 500,000 SGD, forming a prominent peak.
Long Tail: The distribution has a long right tail, with a small number of properties priced above 800,000 SGD, reaching up to 1,600,000 SGD, indicating a higher price range.
Skewness: The data is right-skewed, with the majority of properties priced on the lower end, and a rapid decrease in frequency as prices increase.
This chart indicates that the vast majority of HDB resale prices are relatively moderate, with only a few high-priced properties.
hdb_sf <- hdb_sf %>%
mutate(`resale_price` = log(resale_price))ggplot(data=hdb_sf, aes(x=`resale_price`)) +
geom_histogram(bins=20, color="black", fill="light blue")
This one is much better! Very close to normal distribution. We will take this for regression later.
3.3.2 Multiple Histogram Plots distribution of variables
floor_area_sqm <- ggplot(data=hdb_sf, aes(x=floor_area_sqm)) +
geom_histogram(bins=20, color="black", fill="lightblue")
remaining_lease_yr <- ggplot(data=hdb_sf, aes(x=remaining_lease_yr)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_CBD <- ggplot(data=hdb_sf, aes(x=PROX_CBD)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_MRT <- ggplot(data=hdb_sf, aes(x=PROX_MRT)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_BUS <- ggplot(data=hdb_sf, aes(x=PROX_BUS)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_HAWKER <- ggplot(data=hdb_sf, aes(x=PROX_HAWKER)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_TOPSCHOOL <- ggplot(data=hdb_sf, aes(x=PROX_TOPSCHOOL)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_PARK <- ggplot(data=hdb_sf, aes(x=PROX_PARK)) +
geom_histogram(bins=20, color="black", fill="lightblue")
PROX_MKT <- ggplot(data=hdb_sf, aes(x=PROX_MKT)) +
geom_histogram(bins=20, color="black", fill="lightblue")
mrt_count <- ggplot(data=hdb_sf, aes(x=mrt_count)) +
geom_histogram(bins=20, color="black", fill="lightblue")
bus_count <- ggplot(data=hdb_sf, aes(x=bus_count)) +
geom_histogram(bins=20, color="black", fill="lightblue")
hawker_count <- ggplot(data=hdb_sf, aes(x=hawker_count)) +
geom_histogram(bins=20, color="black", fill="lightblue")
topschool_count <- ggplot(data=hdb_sf, aes(x=topschool_count)) +
geom_histogram(bins=20, color="black", fill="lightblue")
park_count <- ggplot(data=hdb_sf, aes(x=park_count)) +
geom_histogram(bins=20, color="black", fill="lightblue")
market_count <- ggplot(data=hdb_sf, aes(x=market_count)) +
geom_histogram(bins=20, color="black", fill="lightblue")# Arrange all plots in a grid
ggarrange(
floor_area_sqm, remaining_lease_yr, PROX_CBD, PROX_MRT, PROX_BUS,
PROX_HAWKER, PROX_TOPSCHOOL, PROX_PARK, PROX_MKT, mrt_count,
bus_count, hawker_count, topschool_count, park_count, market_count,
ncol = 3, nrow = 5
)
3.3.3 Drawing Statistical Point Map
Reveal the geospatial distribution HDB resale prices in Singapore.
tmap_mode("view")tmap mode set to interactive viewing
mpsz <- st_make_valid(mpsz)tm_shape(mpsz)+
tm_polygons() +
tm_shape(hdb_sf) +
tm_dots(col = "resale_price",
alpha = 0.6,
style="quantile") +
tm_view(set.zoom.limits = c(11,14))tmap_mode("plot")tmap mode set to plotting
This map displays the spatial distribution of HDB resale prices across different areas in Singapore. Yellow represents lower resale prices, ranging from approximately 150,000 to 355,000; Light Brown to Dark Brown represents higher resale prices, with the darkest brown indicating the highest prices, up to about 1,568,000. Darker colors concentrated in certain areas, indicating clusters of higher-priced properties.
04 Multiple Linear Regression
4.1 Collinearity Analysis
It is nessacery to check correlation and remove variables with high correlation to improve accuracy
mdata <- hdb_sf %>%
select(resale_price, geometry, month,
floor_area_sqm, remaining_lease_yr,
PROX_CBD, PROX_MRT, PROX_BUS,PROX_HAWKER, PROX_TOPSCHOOL, PROX_PARK, PROX_MKT,
mrt_count,bus_count, hawker_count, topschool_count, park_count, market_count)mdata_nogeo <- mdata %>%
st_drop_geometry()
corrplot::corrplot(cor(mdata_nogeo[, 3:17]),
diag = FALSE,
order = "AOE",
tl.pos = "td",
tl.cex = 0.1,
method = "number",
type = "upper")
The correlation matrix above shows that all the correlation values are below 0.65. Hence, there is no sign of multicolinearity. We check the VIF again to double check
model <- lm(resale_price ~ floor_area_sqm + remaining_lease_yr + PROX_CBD + PROX_MRT + PROX_BUS +
PROX_HAWKER + PROX_TOPSCHOOL + PROX_PARK + PROX_MKT + mrt_count +
bus_count + hawker_count + topschool_count + park_count + market_count, data = hdb_sf)
vif_values <- vif(model)
print(vif_values) floor_area_sqm remaining_lease_yr PROX_CBD PROX_MRT
1.034321 1.382169 1.837811 1.848312
PROX_BUS PROX_HAWKER PROX_TOPSCHOOL PROX_PARK
1.061921 1.908965 1.728239 1.739462
PROX_MKT mrt_count bus_count hawker_count
1.305292 1.923894 1.209827 2.175077
topschool_count park_count market_count
1.623932 1.651968 1.216358
The results for vif_values show that all variables have Variance Inflation Factor (VIF) values below 10, indicating that multicollinearity is not a serious issue among these variables. Generally, VIF values below 10 are considered acceptable, and values below 5 indicate almost no multicollinearity.
export a rds data
write_rds(mdata, "data/model/mdata.rds")4.2 Data Splitting
mdata <- read_rds("data/model/mdata.rds")The entire data are split into training and test data sets with 65% and 35% respectively by using initial_split() of rsample package. rsample is one of the package of tigymodels.
# Filter 2023 data for training
train_data <- hdb_sf %>%
filter(month >= "2023-01" & month <= "2023-12")
# Filter July to September 2024 data for testing
test_data <- hdb_sf %>%
filter(month >= "2024-07" & month <= "2024-09")write_rds(train_data, "data/model/train_data.rds")
write_rds(test_data, "data/model/test_data.rds")train_data <- read_rds("data/model/train_data.rds")
test_data <- read_rds("data/model/test_data.rds")4.3 Building a non-spatial multiple linear regression
price_mlr <- lm(resale_price ~ floor_area_sqm +
remaining_lease_yr + PROX_CBD + PROX_HAWKER +
PROX_MRT + PROX_PARK + PROX_MKT + mrt_count +
bus_count + hawker_count + topschool_count +
park_count + market_count,
data = hdb_sf)
summary(price_mlr)
Call:
lm(formula = resale_price ~ floor_area_sqm + remaining_lease_yr +
PROX_CBD + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_MKT +
mrt_count + bus_count + hawker_count + topschool_count +
park_count + market_count, data = hdb_sf)
Residuals:
Min 1Q Median 3Q Max
-1.58745 -0.06610 -0.00299 0.06146 0.67886
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.184e+01 1.505e-02 786.511 < 2e-16 ***
floor_area_sqm 1.009e-02 1.737e-04 58.106 < 2e-16 ***
remaining_lease_yr 1.004e-02 7.861e-05 127.688 < 2e-16 ***
PROX_CBD -1.964e-05 3.326e-07 -59.066 < 2e-16 ***
PROX_HAWKER -2.555e-05 3.745e-06 -6.821 9.65e-12 ***
PROX_MRT -1.002e-04 4.481e-06 -22.363 < 2e-16 ***
PROX_PARK -4.726e-05 4.042e-06 -11.693 < 2e-16 ***
PROX_MKT 7.080e-05 6.577e-06 10.764 < 2e-16 ***
mrt_count -4.537e-03 1.364e-03 -3.327 0.000882 ***
bus_count 1.759e-03 2.378e-04 7.397 1.53e-13 ***
hawker_count -8.176e-03 1.547e-03 -5.286 1.28e-07 ***
topschool_count 1.134e-02 2.045e-03 5.546 3.01e-08 ***
park_count -3.442e-03 1.099e-03 -3.133 0.001737 **
market_count 1.019e-02 7.765e-04 13.119 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1022 on 8249 degrees of freedom
Multiple R-squared: 0.7332, Adjusted R-squared: 0.7328
F-statistic: 1744 on 13 and 8249 DF, p-value: < 2.2e-16
The linear regression model predicts resale_price using several variables such as floor_area_sqm, remaining_lease_yr, and proximity measures to amenities (PROX_CBD, PROX_HAWKER, PROX_MRT, etc.), along with various facility counts (mrt_count, bus_count, etc.). The model explains approximately 73.3% of the variance in resale_price (Multiple R-squared = 0.7332, Adjusted R-squared = 0.7328), indicating a strong fit.
Most coefficients are statistically significant (p < 0.001), suggesting that the predictors have a meaningful impact on resale_price. For instance: - Positive influence: floor_area_sqm and remaining_lease_yr both positively impact resale prices, as expected. - Negative influence: Proximity to CBD (PROX_CBD), MRT stations (PROX_MRT), parks (PROX_PARK), and hawker centers (PROX_HAWKER) generally shows a negative impact, potentially due to noise or congestion.
The residuals are relatively small, with a residual standard error of 0.1022, indicating that the model predictions closely align with the actual values. This model is statistically significant with an overall F-statistic of 1744 (p < 2.2e-16), confirming the model’s predictive strength.
4.4 Basic GWR Predictive Model
Convert the sf data.frame to SpatialPointDataFrame calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.
train_data_sp <- as_Spatial(train_data)
train_data_spclass : SpatialPointsDataFrame
features : 6355
extent : 11646.69, 45192.3, 28097.64, 48622.47 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 28
names : month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price, address, remaining_lease_yr, remaining_lease_mth, postal, ...
min values : 2023-01, ANG MO KIO, 3 ROOM, 1, ALJUNIED CRES, 01 TO 03, 52, DBSS, 1966, 42 years 01 month, 11.9183905730784, 1 BEACH RD, 42, 1, 050004, ...
max values : 2023-12, YISHUN, 3 ROOM, 999B, YUAN CHING RD, 40 TO 42, 155, Terrace, 2020, 95 years 06 months, 13.9978321147582, 999B BUANGKOK CRES, 95, 11, 824306, ...
Compute adaptive bandwidth
# Perform adaptive bandwidth selection for geographically weighted regression
bw_adaptive <- bw.gwr(resale_price ~ floor_area_sqm +
remaining_lease_yr +
PROX_CBD + PROX_MRT + PROX_BUS +
PROX_HAWKER + PROX_TOPSCHOOL + PROX_PARK + PROX_MKT +
mrt_count + bus_count + hawker_count + topschool_count + park_count + market_count,
data =train_data,
approach = "CV",
kernel = "gaussian",
adaptive = TRUE,
longlat = FALSE)The result shows that 46 neighbour points will be the optimal bandwidth to be used if adaptive bandwidth is used for this data set.
write_rds(bw_adaptive, "data/model/bw_adaptive.rds")4.5 Preparing coordinates data
Extracting coordinates data
coords <- st_coordinates(mdata)
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)coords_train <- write_rds(coords_train, "data/model/coords_train.rds" )
coords_test <- write_rds(coords_test, "data/model/coords_test.rds" )Droping geometry field
train_data <- train_data %>%
st_drop_geometry()4.6 Geographical Random Forest Model
set.seed(1234)
gwRF_adaptive <- grf(formula = resale_price ~ floor_area_sqm +
remaining_lease_yr +
PROX_CBD + PROX_MRT + PROX_BUS +
PROX_HAWKER + PROX_TOPSCHOOL + PROX_PARK + PROX_MKT +
mrt_count + bus_count + hawker_count + topschool_count + park_count + market_count,
dframe=train_data,
bw=46,
kernel="adaptive",
coords=coords_train)Model Summary:
Residuals Analysis:
OOB Residuals: Median and mean values close to zero, with a slight range in values, indicating low residual error in predictions.
Predicted Residuals (Non-OOB): Shows a similar trend with a low median and mean close to zero, suggesting minimal bias.
Local Variable Importance: The summary mentions local variable importance, which likely provides insights into how variable importance varies spatially.
Overall Metrics for Local Model:
MSE (OOB): 0.004 and R-squared (OOB): 88.98%, which is consistent with the global model’s performance.
Predicted MSE (Not OOB): 0.002, with an R-squared of 95.72% on hold-out data, indicating that the model generalizes well.
AIC/AICc: Both metrics are lower on the hold-out data than the OOB data, indicating a better fit on the validation data.
Interpretation
This geographically weighted random forest model performs well, with a high R-squared, low MSE, and significant variable importance concentrated on key factors like remaining_lease_yr, floor_area_sqm, and PROX_CBD. The model shows strong predictive power and generalizes well to unseen data (as indicated by the performance on non-OOB data).
write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")gwRF_adaptive <- read_rds("data/model/gwRF_adaptive.rds")4.7 Predicting by using test data
test_data <- cbind(test_data, coords_test) %>%
st_drop_geometry()gwRF_pred <- predict.grf(gwRF_adaptive,
test_data,
x.var.name="X",
y.var.name="Y",
local.w=1,
global.w=0)GRF_pred <- write_rds(gwRF_pred, "data/model/GRF_pred.rds")GRF_pred <- read_rds("data/model/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)append the value back to data frame
test_data_p <- cbind(test_data, GRF_pred_df)write_rds(test_data_p, "data/model/test_data_p.rds")4.8 Calculating Root Mean Square Error
The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis. In the code chunk below, rmse() of Metrics package is used to compute the RMSE.
rmse(test_data_p$resale_price,
test_data_p$GRF_pred)[1] 0.1438254
4.9 Visualising the predicted values
Visualise the actual resale price and the predicted resale price.
ggplot(data = test_data_p,
aes(x = GRF_pred,
y = resale_price)) +
geom_point()
05 Conclusion
This study provides an enhanced approach to predicting HDB resale prices by addressing the spatial characteristics often ignored in conventional models. Housing, as a significant component of household wealth, requires accurate predictive models to support better decision-making for both buyers and policymakers. Traditional models, such as Ordinary Least Squares (OLS), fail to account for the inherent spatial autocorrelation and heterogeneity in housing data, potentially leading to biased and inconsistent predictions. By adopting a Geographically Weighted Regression (GWR) model, this study leverages the spatial nuances in housing data to improve the accuracy and reliability of price predictions.
Through an analysis of HDB resale transaction records from 2023, we calibrated a geographically weighted random forest model to forecast resale prices for July to September 2024. The model integrates both structural factors (such as floor area and remaining lease duration) and locational factors (such as proximity to MRT stations, schools, and markets) to capture the key determinants of housing prices. The results demonstrate that structural factors like remaining lease years and floor area have a strong positive influence on resale prices, while locational factors, particularly proximity to the Central Business District (CBD) and MRT stations, also significantly impact prices. These findings align with expectations that both the quality and accessibility of a property contribute to its market value.
The geographically weighted model achieved high explanatory power, with an R-squared above 88% on out-of-bag data, indicating that the spatially adaptive approach successfully captures both local and global influences on housing prices. Compared to traditional OLS methods, this model provides a more nuanced understanding of the spatial heterogeneity in housing markets, accommodating variations in price determinants across different neighborhoods.
In summary, this study demonstrates the effectiveness of Geographically Weighted Models in improving predictive accuracy for HDB resale prices by accounting for spatial factors. These findings underscore the importance of incorporating both structural and locational variables in housing price models, providing valuable insights for stakeholders in housing markets. For future research, expanding the model with additional neighborhood characteristics or exploring temporal dynamics could further refine predictive accuracy and support more comprehensive housing market analysis.